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ABSTRACT 

The rise of Web 2.0 is signaled by sites such as Flickr, 
del.icio.us, and YouTube, and social tagging is essential to 
their success. A typical tagging action involves three com- 
ponents, user, item (e.g., photos in Flickr), and tags (i.e., 
words or phrases). Analyzing how tags are assigned by cer- 
tain users to certain items has important implications in 
helping users search for desired information. In this pa- 
per, we explore common analysis tasks and propose a dual 
mining framework for social tagging behavior mining. This 
framework is centered around two opposing measures, sim- 
ilarity and diversity, being applied to one or more tagging 
components, and therefore enables a wide range of analy- 
sis scenarios such as characterizing similar users tagging di- 
verse items with similar tags, or diverse users tagging similar 
items with diverse tags, etc. By adopting different concrete 
measures for similarity and diversity in the framework, we 
show that a wide range of concrete analysis problems can 
be defined and they are NP-Complete in general. We de- 
sign efficient algorithms for solving many of those problems 
and demonstrate, through comprehensive experiments over 
real data, that our algorithms significantly out-perform the 
exact brute-force approach without compromising analysis 
result quality. 

I. INTRODUCTION 

Tagging is a core activity on the social web. It reflects 
a wide range of content interpretations and serves many 
purposes, ranging from bookmarking websites in del.icio.us, 
organizing personal videos in YouTube, and characterizing 
movies in MovieLens. While one can possibly examine tags 
used by a single user on a single item, it is easy to see that 
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the task becomes quickly intractable for a collection of tag- 
ging actions involving multiple users and items. In this pa- 
per, we aim to formalize the analysis of the tagging behavior 
of a set of users for a set of items and develop appropriate 
algorithms to complete that task. 

A typical tagging action involves three components, user, 
item, and tag. We propose to study a variety of analysis 
tasks that involve applying two alternative measures, sim- 
ilarity and diversity, to those components and producing 
groups of similar or diverse items, tagged by groups of simi- 
lar or diverse users with similar or diverse tags. For example, 
one possible analysis outcome would be: "teenagers use di- 
verse tags for action movies" or "males from New York and 
California use similar tags for movies directed by Cameron 
and Spielberg" . In Section 2.1 and 2.2, we will describe some 
of these problem instances that are enabled in our frame- 
work. A general dual mining framework that encompasses 
many common analysis tasks is then defined in Section 2.3. 

A core challenge in this dual mining framework is the de- 
sign of similarity and diversity measures. For user or item 
components, defined by (attribute, value) pairs, several ex- 
isting comparison techniques have been proposed that can 
leverage their structured nature or bipartite connections. 
Section 2.1.1 illustrates some of those techniques. 

Comparing similarity and diversity of tags used by various 
users on different items, however, presents a new challenge. 
First, tags are drawn from a much larger vocabulary than 
user or item attributes and exhibit a long tail characteristic. 
Second, it is often the case that different tags are used for the 
same set of items and, accounting for those tags separately 
would not capture their co-usage. Finally, tags may have 
linguistic connections such as synonymy. In order to capture 
tag similarity and diversity, we propose to summarize tags 
first to account for their co-usage and semantic relationships. 
Section 2.1.2 describes some techniques from Information 
Retrieval and Machine Learning that can be used. 

The tag component is also the most interesting among 
the three to be analyzed. Figure 1 shows a rendering of a 
tag summarization for Woody Allen movies in the form of 
a tag cloud. Similarly, Figure 2 shows a summarization of 
tags for the same movies from California users only. In both 
cases, summarization is defined as a simple frequency- based 
tag cloud where the size of a tag corresponds to how often it 
has been used on those movies. While "Woody" and "Allen" 
are not surprisingly common to both, the two clouds are dif- 
ferent: all users highlight the dramatic, tragic and disturbing 
nature of those movies, and California users emphasize tags 
such as classic and psychiatry. Moreover, one of the direc- 
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tor's popular movies, Noiva Nervosa is prominent in the tag 
cloud of all users, and yet is conspicuously absent in that 
of California users. Our goal is to define analysis tasks that 
can help users easily spot those interesting patterns and use 
that knowledge in subsequent actions. 

We emphasize that, in this study, it is not our goal to 
advocate one particular similarity or diversity measure over 
another. Rather, we focus on formalizing the Tagging Be- 
havior Dual Mining framework and the problem defini- 
tions, and designing general algorithms that will work well 
for most measures. 

The analysis problems formally defined in our proposed 
framework fall into the wider category of constrained opti- 
mization problems. We are looking for groups of tagging 
actions that achieve maximum similarity or diversity on one 
or more components while satisfying a set of constraints 
such as support. Not surprisingly, as our complexity anal- 
ysis shows in Section 3, those problems are NP-Complete 
in general. We propose two sets of efficient algorithms for 
solving them. The first set incorporates Locality Sensitive 
Hashing (LSH) and can be used for problems maximizing 
tagging action component similarity. While traditional LSH 
is frequently used for performing nearest neighbor search 
in high-dimensional spaces, our algorithm finds the bucket 
containing the result set of our tagging behavior analysis. 
The second set of algorithms borrows ideas from techniques 
employed in Computational Geometry to handle the Facil- 
ity Dispersion Problem (FDP) and is effective for problems 
maximizing diversity. Both sets of algorithms possess com- 
pelling theoretical characteristics for problem instances op- 
timizing the dual mining goal without any constraints. For 
both sets, we also propose advanced techniques that return 
better quality results in comparable running time. 

In summary, we make the following main contributions: 

• We formalize the task of analyzing the tagging behav- 
ior of a set of users for a set of items and propose a 
novel general constrained optimization framework for 
tagging behavior mining. 

• We show that the tagging analysis problems are NP- 
Complete and propose efficient algorithms for solving 
the problems. 

• We develop locality sensitive hashing based algorithms 
for solving problems maximizing tagging action com- 
ponent similarity. We also design computational ge- 
ometry based algorithms for problem instances max- 
imizing diversity. We provide theoretical guarantees 
for both sets of algorithms for handling problems opti- 
mizing the dual mining goal without any constraints. 



• We perform detailed experiments on real data to show 
that our proposed algorithms generate equally good re- 
sults as exact brute-force in much less execution time. 

2. THE TAGDM FRAMEWORK 

We model the data on a social tagging site as a triple 
(U,I,T), representing the set of users, the set of items 
and the tag vocabulary, respectively. Each tagging action 
can be considered as a triple itself, represented as (it,i,T), 
where it (E U, i €E I, T C T, respectively. A group of tag- 
ging actions is denoted as g = {(iti, ii, Ti), (112, i 2 , T2}, ...,}• 
We define a user schema, Su = (ai, 02, • • •}, to repre- 
sent each user as a set of attribute values conforming to 
the user schema: u = {u.ai, u.a,2, ■ ■ ■), where each u.a x 
is a value for the attribute a x £ Su- For example, let 
Su = (age, gender, state, city), a user can be represented 
as (18, student, new york, nyc). Similarly, we define an item 
schema, Si = (ai, 02, . . .), to represent each item as a set of 
attribute values, i = (i.ai, i.a^, ■ ■ .), where each i.a y is a 
value for the attribute a y 6 Si. 

Each tagging action therefore can be represented 
as an expanded tuple that concatenates the user at- 
tributes, the item attributes and the tags: r = (r u .a\, 
r u .a 2 , . . ■ , ri.ai, ri.a 2 , . . . , T). G denotes the set of all such 
tagging action tuples. Many social sites have hundreds of 
millions of such tuples. Most, if not all, mining tasks in- 
volve analyzing sets of such tuples collectively. While there 
are a number of different ways tagging action tuples can 
be grouped, we adopt the view proposed and experimen- 
tally verified in [6] , where groups of users (or items) that are 
structurally describable (i.e., sharing common attribute value 
pairs) are meaningful to end-users. Such groups correspond 
to conjunctive predicates on user or item attributes. An ex- 
ample of a user describable tagging action group is {gender= 
male, state=new york}, and of an item describable group is 
{genre=comedy, director=woody alien}. Next we define 
an essential characteristic of a set of tagging action groups. 

Definition 1. Group Support. Given the input set 
of tagging action tuples G, the support of a set of tag- 
ging action groups Q — {gi, g 2 , . . .} over G, is defined as 
Support^, = \{r e G \ 3g x £ Q,r £ g x }\. Intuitively, group 
support measures the number of input tagging action tuples 
that belongs to at least one of the groups in Q. 

2.1 Concrete TagDM Problems 

A large number of concrete TagDM problem instances can 
be defined, with their variations coming from two main as- 
pects. The first category of variations depends on which 
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measure, similarity or diversity, the user is interested in ap- 
plying to which tagging components (i.e, users, items, or 
tags). For example, a user can be interested in identifying 
similar tags produced by similar user groups on diverse item 
groups, or similar tags produced by diverse user groups on 
similar item groups. Since there are three components, each 
of which can adopt one of two measures, this variation alone 
can lead to 2 3 = 8 different problem instances. 

The second category of variations depends on which com- 
ponents the user is adding to the optimization goal and 
which components the user is adding to the constraints. 
For example, a user can be interested in finding tagging 
action groups that maximize a tag diversity measure and 
satisfy user and item similarity constraints, or groups that 
maximize a combination of tag diversity and user diversity 
measures and satisfy an item similarity constraint. Since 
each component can be part of the optimization goal, or 
part of the constraint, or neither, this variation can lead to 
3 3 — 1 = 26 different problem instances. 

Combining both categories of variations, there is a total 
of 112 concrete problem instances that our framework cap- 
tures! Table f illustrates six of the problem instantiations 
that we have investigated in detail. In particular, we focus 
on problems with all three components with constraints on 
user and item and optimization on the tag component, since 
those are the most novel and intuitive mining problems. 
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Table 1: Concrete TagDM Problem Instantiations. 
Column C lists the constraint dimensions and col- 
umn O lists the optimization dimensions. 



Before we formalize the mining problems, we introduce 
the core concept of Dual Mining Function. 

Definition 2. Dual Mining Function. A Dual Min- 
ing Function, F : Q x b x m —> float, takes as inputs: Q, 
a set of tagging action groups; b G {users, items, tags}, a 
tagging behavior dimension; m G {similarity, diversity}, 
a dual mining criterion; and produces a float score, s, that 
quantifies the mining criterion over the particular dimension 
for the set of tagging action groups. 

Definition 2 defines a general dual mining function that 
computes a score using arbitrary evaluations over the tag- 
ging action groups. In practice, there is a subset of dual 
mining functions that are more restricted and yet powerful 
enough for solving many real mining scenarios: 

Definition 3. Pair- Wise Aggregation Dual Mining 
Function. A Pair-Wise Aggregation (PA) Dual Mining 
Function, F pa : Q xbxm — > float, is a dual mining function 
with two component function F p : gt x gj x b x rn — >• float 
and F a : {si,S2, ■ ■ ■} — > float, where (gi,gj) is a pair 
of distinct tagging action groups and each Si is an inter- 
mediate score produced by F p , such that: F pa (Q,b,m) = 
Fa({F p (gi,gj,b,m)}, Vgi,gj G Q,i^j. 

Pair-wise dual mining functions simplify the general dual 
mining functions by enabling the overall mining score to be 
computed via aggregating the scores computed over pairs 



of the tagging action groups, which is often much easier to 
define and compute. We now present a few examples of 
the pair-wise dual mining function. The key to a pair-wise 
dual mining function is the pair-wise comparison function, 
F p (gi,g2,b,m), where gi and <?2 are distinct tagging action 
groups, and b G {users, items, tags}, is a tagging behavior 
dimension, and m G {similarity, diversity}, is a dual 
mining criterion. 

2.1.1 User & Item Dimensions Dual Mining 

Given a user describable tagging action group 1 , its user 
dimension is effectively its user group description, i.e., a set 
of (attribute, value) pairs that describes the group. There- 
fore, given two user groups, g\ and g 2 , their similarity or 
diversity can be captured mainly in two ways: 1) structural 
distance between the user group descriptions and 2) set dis- 
tance based on the items they have rated. 

Let A be the set of user attributes shared between two 
user describable tagging action groups g\ and g 2 , an example 
of the pair-wise comparison function leveraging structural 
distance is the following: 

F P (gi,g2, users, similarity) = J2 a eA aim(v 1 ,v 2 ) 
where a.vi and a.V2 belong to the set of user attribute value 
pairs and sim can be a string similarity function that simply 
computes the edit distance between two values or a more so- 
phisticated similarity function that takes domain knowledge 
into consideration. For example, a domain-aware similarity 
function can determine "New York City" to be more similar 
to "Boston" than to "Dallas". F p (gi, g 2 , users, diversity) 
can be similarly defined using the inverse function. 

Let gi.I and g 2 .I be the sets of items tagged by tuples in g\ 
and g 2 , respectively, an example of the pair- wise comparison 
function leveraging set distance is the following: 

F'( Sl , g 2 , users, similarity) = lll": 91 -^ 92 n! 
which simply computes the percentages of items tagged by 
both groups (akin to Jaccard distance.) If numerical ratings 
are available for each tagging tuple, a more sophisticated set 
distance similarity function can further impose an additional 
constraint that an item is common to both groups if its av- 
erage ratings in both are close. F p (gi, g 2 , users, diversity) 
can be similarly defined using the inverse function. 

2.1.2 Tag Dimension Dual Mining 

The tag dimension is fundamentally different from the 
user and item dimensions. First, there is no fixed set of 
attributes associated with the tag dimension, therefore the 
structural distance does not apply. Second, tags are chosen 
freely by users using diverse vocabularies. As a result, a 
single tagging action group can contain a large number of 
tags. Both characteristics make comparing two sets of tags 
very difficult. 

We propose a two-step approach for handling the tag di- 
mension. First, we propose an initial step to summarize the 
set of all tags of a tagging action group into a smaller rep- 
resentative set of tags, called group tag signature. Second, 
we apply comparison functions to compute distance between 
signatures. Once again, we are not advocating any partic- 
ular way of producing signatures and/or comparing them. 
Rather, we simply argue for the need for tag signatures and 
their comparisons. 

1 Since the user and item dimensions share the same charac- 
teristics in the dual mining framework, we present here only 
the user dimension for simplicity. 
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Group tag signature generation: Given a group 
of tagging actions g = {{ui, ii, Ti), (w 2 , 12, T 2 ), - - ■}, we 
aim to summarize the tags in Ti U T 2 U . . . into a 
tag signature T rep (g). The general form of T rep (g) is 
{(tei,Wi), (tC2 , W2) , - . . } where tej is topic category (can be 
a tag itself) and u>; is weight, i.e., relevance of g for Cj. 

One can define several methods to compute tag signa- 
tures. For example, when tags are hand-picked by editors 
and hence the number of unique tags is small, a simple def- 
inition can be T rep (g) = {(i, freq(t)) | t € Ti U T 2 U . . .}, 
where freq(t) computes how many times t is used in g. 

Most collaborative tagging sites encourage users to create 
their own tags, thereby creating a long tail of tags. This 
raises challenges such as sparsity and the choice of different 
tags to express similar meanings. Techniques from Infor- 
mation Retrieval and Machine Learning such as tf*idf and 
Latent Dirichlet Allocation (LDA) can be used for tag sum- 
marization. LDA aggregates tags into topics based on their 
co-occurrence and reason at the level of topics, and han- 
dles long tail issues [2]. Also, a Web service such as Open 
Calais 2 can be used to match a set of tags to a set of pre- 
defined categories through sophisticated language analysis 
and information extraction. 

Comparing group tag signatures: When tagging ac- 
tion groups are represented as tag signatures over the same 
set of topics, we can leverage many existing vector com- 
parison functions to compute the distance between any two 
group tag signature vectors pair-wisely. An example is sim- 
ply cosine similarity as follows: 

F p(9i, 32, "tags, similarity) = cos(9(T rep (g 1 ),T rep (g 2 ))), 
where 9 is the angle between the two vectors. 
Fp(gi, g2, tags, diversity) can be defined similarly 

The comparison can also be enhanced by using an ontol- 
ogy such as WordNet to compare entries of similar topics. 

2.2 Common Problem Instances 

We are now ready to define two of the concrete dual min- 
ing problems listed in Table 1. The first one aims to find 
similar user sub-populations who agree most on their tag- 
ging behavior for a diverse set of items. The second one 
aims to find diverse user sub-populations who disagree most 
on their tagging behavior for a similar set of items. 

Problem 1. Identify a set of tagging action groups, 
G opt = {91,92, ■ ■ ■}, such that: 

• Vg x G G opt , g x is user- and/ or item-describable; 

• 1 < \G opt \ < k; 

• Support^ > p; 

• Fi(G opt , users, similarity) > q; 

• F 2 (G opt , items, diversity) > r; 

• F2,{G opt , tags, similarity) is maximized. 

where Fi and F2 are structural similarity based dual mining 
functions as defined in Definition described in Section 2.1.1, 
and F3 is the LDA based tag dual mining function as de- 
scribed in Section 2.1.2. 

For k — 2, p — 100, q = 0.5, and r = 0.5, solving the prob- 
lem on the full set of tagging action tuples in MovieLens [5] 
can give us the following G opt : 

gi — {(gender, male), (age, young), (actor, j.aniston), 

(comedy, drama, friendship)} 
32 ={ (gender, male), (age, young), (actor, j.timberlake), 

(drama, friendship)} 

2 https:/ '/www. opencalais. com/ 



which illustrates the interesting pattern that male young 
users assign similar tags, drama and friendship, to movies 
with "Jennifer Aniston" and "Justin Timberlake," the for- 
mer for her involvement in the popular TV show "Friends" 
and the latter for his movie "The Social Network." 

A closely related problem to Problem 1 is to inverse the 
similarity and diversity constraints for the user and item 
components, i.e., finding diverse user sub-populations who 
agree most on their tagging behavior for a similar set of 
items (Problem 3 in Table 1). Both problems focus on opti- 
mizing the tag similarity and therefore can be solved using 
similar techniques. Next, we define a problem that aims to 
identify groups that disagree on their tagging behavior. 

Problem 4. Identify a set of tagging action groups, 
G° pt = {91,92,. ■■}, such that: 

• Vg x G G opt , g x is user- and/or item-describable; 

• 1 < |G opt | < k; 

• Support 1 ^ P > p; 

• F 1 (G° pt , users, diversity) > q; 

• F 2 (G opt , items, similarity) > r; 

• F-$(G opt , tags, diversity) is maximized. 

where F\, F 2 , and F3 are similarly defined as in Problem 1. 

For k = 2, p = 100, q = 0.5, and r = 0.5, solving the 
problem on the full set of tagging action tuples in MovieLens 
can give us the following G opt : 

9i = {(gender, male), (age, teen), (genre, action), 

(gun, special effects)} 
32 ={(gender, female), (age, teen), (genre, action), 

(violence, gory)} 

which illustrates teenaged male users and female users have 
entirely different perspectives on action movies. This gives 
a user a new insight that there is something about action 
movies that is causing the different reactions among two 
different groups of users. 

2.3 Generalizing the TagDM Framework 

We take the novel approach of proposing a general con- 
strained optimization framework for tagging behavior min- 
ing, upon which various analysis tasks can be instantiated 
and optimized. 

Definition 4. Tagging Behavior Dual Mining 
(TagDM) Problem. Given a triple (G, C, O) in the 
TagDM framework where G is the input set of tagging ac- 
tions and C, O are the sets of constraints and optimiza- 
tion criteria respectively, the Tagging Behavior Dual Min- 
ing problem ts to identify a set of tagging action groups, 
Qopt _ [g lj g 2 ,...'} for b G {users, items, tags} and m G 
{similarity, diversity}, such that: 

• Vg x G G opt , g x is user- and/or item-describable; 
. k lo < \G opt \ <k hi ; 

• bupportQ > p; 

• Vci G C,c l .F(G opt ,b,m) > threshold; 

• Y, Oj £o,0j.F(G opt ,6, m) is maximized. 

Intuitively, TagDM aims to identify a set of user- and/or 
item-describable sub-groups from input tagging actions, 
such that the dual mining constraints are satisfied and a 
dual mining goal is optimized. We now clearly see how this 
framework generalizes the common problem instances given 
in Section 2.2. 



1570 



3. COMPLEXITY ANALYSIS 

In this section we provide the proof that the Tagging Be- 
havior Dual Mining problem is NP-Complete. The decision 
version of the TagDM problem is defined as follows: 

Given a triple (G, C, O) , is there a set of tagging ac- 
tion groups G opt = {<?i,g2,- ■ ■} such that $3 . e0 (oj.Wf x 
oj.F(G op t,Oj.D,Oj.M) > a subject to: 

• Vg x € G opt ,g x is user- and/or item-describable. 
. k lo < \G opt \ < k hi 

• Support^ v > p 

• Vc, G C,c i .F{G ovt ,c i .D,c i .M) > a.Th 

Theorem 1. The decision version of the TagDM problem 
is NP-Complete. 

Proof. The membership of decision version of TagDM 
problem in NP is obvious. To verify NP-Completeness, we 
reduce Complete Bipartite Subgraph problem (CBS) to our 
problem and argue that a solution to CBS exists, if and only 
if, a solution our instance of TagDM exists. First, we show 
that the problem CBS is NP-Complete. 

Lemma 1. Complete bipartite subgraph problem (CBS) is 
NP-Complete. 

Proof. The decision version of CBS is defined as follows: 
Given a bipartite graph G' = (Vi, V 2 , E) and two positive 
integers m < \Vi\,n 2 < | V2 1 , are there two disjoint subsets 
Vi C Vi , V 2 C V 2 such that |Vi'| = m, |V^'| = n 2 and u € 
Vi , ij € V 2 implies that {u, v} G £. 

The membership of CBS in NP is obvious. We ver- 
ify the NP-Completeness of the problem by reducing it 
to Balanced Complete Bipartite Subgraph (BCBS) prob- 
lem which is defined as : Given a bipartite graph G" = 

(Vi i V2 ,E) and a positive integer n , find two disjoint sub- 
/// // /// // /// /// / 

sets V 1 C V 1 , V 2 C V 2 such that | V 1 j = | V 2 | = n and 
/// /// / 

u € Vi ,v £ V 2 implies that {u, v} G E . This problem 

was proved to be NP-Complete by reduction from Clique 

in [14]. We can reduce BCBS to CBS by passing the input 

graph G"{V", V 2 , E) of BCBS to CBS and setting m and 

n 2 to n . If a solution exists for the CBS instances, then the 
/// /// 

disjoint subsets V\ , V 2 form a balanced complete bipartite 
subgraph in G" . □ 

We have already established that TagDM problem is in 
NP. To verify its NP-Completeness, we reduce CBS to the 
decision version of our problem. Given an instance of 
the problem CBS with G = (Vi,V 2 ,E) and positive in- 
tegers n\,n 2 , we construct an instance of TagDM prob- 
lem such that there exists a complete bipartite subgraph 
induced by disjoint vertex subsets Vi C V\ , V 2 C V 2 and 
|Vi I = m, \V 2 I = n 2 , if and only if, a solution to our in- 
stance of TagDM exists. 

First, we create an user schema Su = (ai, a 2 , . . . , a\v 2 \) 
such that for each vertex Vj € Vi, there exists a correspond- 
ing user attribute aj G Su- Next, we define a set of users 
U = u 2 , . . . , W|Vi|}- Again, for each vertex w» G Vi there 
exists a corresponding user Ui G U. 

For all pairs of vertices (vi ,Vj),Vi G Vi, Vj € V 2 , we set 
Ui.aj to 1 if {vi,Vj} G E; else, we set it to a unique value 
such that Uxi-Oyi 7^ u x2 .a y2 unless xi = x 2 ,yi — y 2 . Intu- 
itively, we set the j-th attribute of i-th user to 1 if an edge 
exists between vertex pairs (vi, Vj); else, we set it to a unique 
value that is not shared with any attribute of any user. One 
possible way to assign the unique attribute values is to pick 



a previously unassigned value from the set [2, |Vi| x | V 2 \ + 1]- 
Since the number of possible edges is at most |Vi| x \ V 2 \, this 
set suffices to generate unique attribute values. 

We construct an instance of the TagDM problem where 
I = {i} and T = {t}. This results in a set of tagging actions, 
G = i, t), . . . , (m v 1 1 , i, t)} where only the user dimen- 

sion plays a non-trivial role in determining the problem solu- 
tion. Given a pair of users, the pairwise similarity function 
Fi on user dimension measures their structural similarity 
by counting the number of attribute values that are shared 
between them. Intuitively, the problem collapses to that of 
finding a subset of users who share a subset of attributes. 

We then define our TagDM problem instance as : For a 
given a triple (G, G, O), identify a set G opt of tagging action 
groups such that Fz(G opt ,tags,m) > subject to: 

• 1 < |G opt | < m 

• Support 1 ^ P > n\ 

• Fi(G opt , users, similarity) > n 2 X ("*) 

If there exists a solution to this TagDM problem instance, 
then there are n\ users who have identical values for at least 
n 2 of their attributes. If two users u x and u y have same 
values for a set of attributes A, then for all attributes a £ A, 
u x .a = Uy.a — 1. In other words, whenever the attributes 
of two users overlap, the shared attributes can only take a 
value of 1. Any other symbol that was assigned is unique 
and cannot overlap by construction. If there exists a subset 
of attributes A' C Su and a subset of users U' C U, then 
the corresponding vertices in Vi and V 2 form a complete 
bipartite subgraph solving the input instance of BCS. Thus 
TagDM problem is NP-Complete. □ 

3.1 Exact Algorithm 

A brute-force exhaustive approach (henceforth, referred to 
as Exact) to solve the TagDM problem requires us to enu- 
merate all possible combinations of tagging action groups in 
order to return the optimal set of groups maximizing the 
mining criterion and satisfying the constraints. The num- 
ber of possible candidate sets is exponential in the num- 
ber of groups. Evaluating the constraints on each of the 
candidate sets and selecting the optimal result can thus be 
prohibitively expensive. Each tagging action group is asso- 
ciated with a group tag signature vector (the size of which 
is determined by the cardinality of the global set of topics), 
which may introduce additional challenges in the form of 
higher dimensionality. Therefore, we develop practical and 
efficient algorithms. 

We develop two sets of algorithms. The first set comprises 
of locality sensitive hashing based algorithms for handling 
TagDM problem instances maximizing similarity of tagging 
action components. The algorithms are efficient in practice, 
but cannot handle TagDM problem instances maximizing 
diversity. The second set is based on techniques employed in 
computational geometry for the facility dispersion problem 
and is our solution for diversity mining problem instances. 

4. LSH BASED ALGORITHMS 

The first of our algorithmic solutions is based on local- 
ity sensitive hashing (LSH) which is a popular technique 
to solve nearest neighbor search problems in higher dimen- 
sions [13]. The basic idea is to hash similar input items 
into the same bucket (i.e., uniquely definable hash signa- 
ture) with high probability. It performs probabilistic di- 
mension reduction of high dimensional data by projecting 
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input items in higher dimension to a lower dimension such 
that items that were in close proximity in the higher dimen- 
sion get mapped to the same item in the lower dimensional 
space with high probability. LSH guarantees a lower bound 
on the probability that two similar input items fall into the 
same bucket in the projected space and also the upper bound 
on the probability that two dissimilar vectors fall into the 
same bucket. For any pair of points in a high-dimensional 
space, Pi is the probability of two close points falling into 
the same bucket and P2 is the probability of two far-apart 
points falling into the same bucket; we want Pi < Pi . If 
input items are projected from higher dimension d to a lower 
dimension d! , the probabilities can be bounded by: 



P(similar items colliding) > (1 — P\)° 
P(dissimilar items colliding) < P^ 



■d' 



(1) 



This provides an approach to select a set of tagging ac- 
tion groups that are similar in their tagging behavior. In our 
problem, we need to compare the input set G of n tagging 
action groups (i.e., n d-dimensional tag signature vectors, 
where d is the cardinality of the global set of tag topic cate- 
gories mentioned in Section 2.1.2) using a pairwise compar- 
ison function F p {g\, gi, tags, similarity) that operates on 
group tag signature vectors in order to optimize tag similar- 
ity. The result set of tagging action groups G opt maximizing 
tag similarity can be retrieved by finding the k closest vec- 
tors with minimum average pairwise distance between them. 

Note that, our LSH based algorithms works for Problems 
1, 2 and 3 in Table 1 maximizing tag similarity. We first in- 
troduce an algorithm that returns the set of tagging action 
groups G opt , 1 < |G op *| < k having maximum similarity in 
tagging behavior (Column O in Table 1) and then discuss ad- 
ditional techniques to include the multiple hard constraints 
into the solution (Column C in Tabic 1). 

4.1 Maximizing Similarity based on LSH 

Our LSH based algorithm SM-LSH deals with TagDM 
problem instances optimizing tag SiMilarity. In traditional 
LSH, the buckets obtained after hashing input items are 
used to find the nearest neighbors for new items. In our 
solution, we instead rank the buckets based on the scoring 
function. One of the key requirement for good performance 
of LSH is the careful selection of the family of hashing 
functions. In SM-LSH, we use the LSH scheme proposed by 
Charikar [4] which employs a family of hashing functions 
based on cosine similarity. As discussed in Section 2.1.2, 
the cosine similarity between two tagging action group 
tag signature vectors is defined as the cosine of the angle 
between them and can be defined as: 

cos(e(T rep (g x ),T rep ( 9 y))) = J^^Ll 

The algorithm computes a succinct hash signature of the 
input set of n tagging action groups by computing d! in- 
dependent dot products of each d-dimensional group tag 
signature vector T rep (g x ), where g x C G with a random 
unit vector f and retaining the sign of the d' resulting prod- 
ucts. This maps a higher d-dimensional vector to a lower d! - 
dimensional vector (d' << d). Each entry of f is drawn from 
a 1-dimensional Normal distribution N(0,1) with zero mean 
and unit variance. Alternatively, we can generate a spher- 
ically symmetric random vector r of unit length from the 
ci-dimensional space. The LSH function for cosine similarity 
for our problem is given by the following Theorem 2 [4]: 



Theorem 2. Given a collection of n d-dimensional vec- 
tors where each vector T rep (g x ) corresponds to a g x C G, and 
a random unit vector f drawn from a 1-dimensional Normal 
distribution N( 0, 1 ), define the hash function h r as: 
' 1 if r.T rep {g x ) > 
if r.T rep (g x ) < 

Then for two arbitrary vectors T rep 
probability that they will fall in the same bucket is 



h r {T rep (g x )) 



and T rep (g y ), the 



P[h r (T rep (g x )) = h r (T rep (g y ))] 



@{T r e P {g x )i T rep {g y )) 



where 9{T rep (g x ) ,T rep (g y )) is angle between two vectors. 

The proof of the above Theorem 2 establishing that the 
probability of a random hyperplane (defined by r to hash 
input vectors) separating two vectors is directly proportional 
to the angle between the two vectors follows from Goemans 
et. al's theorem [9]. Any pairwise dual mining function for 
comparing tag signatures must satisfy such properties. We 
represent the d'-dimensional-bit LSH function as: 
g(T rep (g x )) = [hri{T rep (g x )), . . . , h rd i{T rep {g x ))] T 

For d! LSH functions and from (1), the probability of sim- 
ilar tag signature vectors falling into the same bucket for all 
d' hash functions is given by: 

P(similar tag vectors colliding) > ( ( T -p^T rcp (g y )) ^ 

Now, each input vector is entered into I hash ta- 
bles indexed by independently constructed hash functions 
gi(Tre P (g x )), gi(Tre P (g x ))- Using this LSH scheme, 

we hash the group tag signature vectors to I different d'- 
dimensional hash signatures(or, buckets). The total number 
of possible hash signatures in each of the I lower dimensional 
space is 2 . However, the maximum bound on the number 
of buckets in each of the lower dimensional space is n. 

While LSH is generally used to find the nearest neighbors 
for new items, we take the novel approach of finding the 
right bucket to output as result of our problem based on 
checking for the number of tagging action groups in result 
set and ranking by scoring function. For each of the I hash 
tables, we first check for satisfiability of 1 < |G opt sj < k in 
each bucket and then rank the buckets based on the scoring 
function in order to determine the result set of tagging action 
groups G opt with maximum similarity. 

Theorem 3. Given a collection of n d-dimensional tag 
signature vectors where each pair of vectors T rep (g x ) and 
T rep (g y ) corresponds to a g x ,g y C G, the probability of find- 
ing result set G opt of k most similar vectors by SM-LSH is 



bounded by: 
P(G opt ) > 1 



x,ye[l,k] 



(9v)) 



Proof. The probability of finding the set of tagging ac- 
tion groups G opt , 1 < I G opt I < k having maximum similarity 
in tagging behavior, P(G opt ): 

= 1 - P(one of k C2 vector pair belongs to different buckets) 
> 1 " J2 x , y e[i,k] P{Tre P (gx),T rep (g y ) in different buckets) 

[ 1 - P(T rep (g x ),Tre P (g y ) in same buckets) ] 



>i-E. 



x,yE[l,k] 



> 1 



e(T rep (a x ),T rsp (g y )) 



□ 



The above theorem establishes the theoretical probabilis- 
tic bound of finding the optimal result set. This is a Monte 
Carlo randomized algorithm whose probability of success 
can be boosted by either increasing the number of hash func- 
tions d! or the number of trials of the algorithm. We also 
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validate the efficiency of our technique in a practical setting 
in Section 6. 

Algorithm 1 is the pseudo code of our SM-LSH algorithm. 
This algorithm may return null result if post-processing of 
all I hash tables yields no bucket satisfying 1 < |G opt j < k. 
This could be either because there are no set of tagging ac- 
tion groups that satisfy the < k requirement of our problem 
instance or because the input parameters to LSH caused 
a partitioning of data that seperated the candidate set of 
groups across different buckets. This motivates us to tunc 
SM-LSH by iterative relaxation that varies the input param- 
eter d! in each iteration. Decreasing the parameter d! in- 
creases the expected number of tagging action groups hash- 
ing into a bucket, thereby increasing the chances of our al- 
gorithm finding the result set. We perform a binary search 
between 1 and d! to identify the correct number of hash 
functions to employ. 

Complexity Analysis: The pre-processing or locality 
sensitive hashing time is bounded by 0(nld' logn) since the 
binary search relaxation iteration runs for log n times in the 
worst case and hashing time is 0(nld'). In the second phase, 
we post-process the buckets for ranking by scoring function 
which is a O(nlogn) operation. The space complexity of 
the algorithm is 0{nl) since there are I hash tables and each 
table has at most n buckets. 

SM-LSH is a fast algorithm with interesting probabilis- 
tic guarantees and is advantageous, especially for high- 
dimensional input vectors. However, the hard constraints 
along user and item dimensions are not leveraged in the op- 
timization solution so far. Next, we discuss LSH based ap- 
proaches for accommodating the multiple hard constraints 
into the solution. 

4.2 Dealing with Constraints: Filtering 

A straightforward method of refining the result set of SM- 
LSH for satisfiability of all the hard constraints in TagDM 
problem instances is post-processing or Filtering. We refer 
to this algorithm as SM-LSH-Fi. For each of the I hash 
tables, we first check for satisfiability of the hard constraints 
in each bucket and then rank the buckets (satisfying hard 
constraints) according to the scoring function in order to 
determine the result set of tagging action groups G app (We 
represent G opt as G app since LSH based technique now per- 
form approximate nearest neighbor search) with maximum 
similarity. Such post-processing of buckets for satisfiabil- 
ity of hard constraints may also return null results, if post- 
processing of hash tables yields no bucket satisfying all the 
hard constraints. Therefore, we propose a smarter method 
that folds the hard constraints concerning similarity as part 
of vectors in high-dimensional space, thereby increasing the 
chances of similar groups hashing into the same bucket. 

4.3 Dealing with Constraints: Folding 

Problems 2 and 3 in Table 1 has two out of the three 
tagging action components to be mined for similarity. In 
order to explore the main idea of LSH, we Fold the hard 
constraints maximizing similarity as soft constraints into 
our SM-LSH algorithm in order to hash similar input tag- 
ging action groups (similar with respect to group tag signa- 
ture vector and user and/or item attributes) into the same 
bucket with high probability. We refer to this algorithm 
as SM-LSH-Fo. We fold the user or item similarity hard 
constraints in Problems 2 and 3 respectively into the op- 
timization goal and apply our algorithm, so that tagging 



Algorithm 1 SM-LSH (G, O, k, d' , I): G opt 

//Main Algorithm 
1: min = 1 
2: max = d! 

3. T re p 4 {}, T re p 4 {} 
4: if Ci.m = similarity then 
5: T^p Unarize user vector 
6: end if 

7: if C2-m = similarity then 

8: T' ep <— Unarize item vector 
9: end if 

10: for x — 1 to n do 

1 1 . T re p (g x ) ^ T re p (g x ) ~\~ T re p (g x ) ~t~ T re p (g x ) 
12: end for 
13: repeat 

14: Buckets «- LSH(G, d' , I) 

15: G opt <- MAX(Rank(Bucfcets, fc)) 

16: if G opt = null then 

17: max = d! — 1 

18: else 

19: min = d! + 1 
20: end if 

21: d' — (min + max) / 2 

22: until (min > max) or (G opt / null) 

23: return G opt 

//LSH(G, a", I): Buckets 
1: for z = 1 to I do 



2: for x = 1 to n do 

3: for y — 1 to d! do 

4: Randomly choose r from d-dimensional Normal 

distribution N(0, 1) 

5: if r.T re p(gx) > then 

6: h ry (T re p(g x )) 4- 1 

7: else 

8: h ry (T rep (g x )) <- 

9: end if 

10: gz(Trep(gx j) = [h rl (T re p(g x )), ■■,hrd'(Trep(gx))} T 

11: end for 

12: end for 



13: end for 

14: Buckets 4- gi(T rep (g x )) U ■ ■ ■ U gi(T rep (g x )) 
15: return Buckets 



action groups with similar user attributes or similar item 
attributes, and similar group tag signature vectors hash to 
the same bucket. For each tagging action group g x C G, we 
represent the categorical user attributes or item attributes 
as a boolean vector and concatenate it with T rep (g x ). We 
map n vectors from a higher (d + X)i=i' Ej=i \ a i = v j\) di- 
mensional space for users (replace \Su\ with \Si\ for items) 
to a lower d! dimensional space. Similar to Algorithm 1, 
we consider I LSH hash functions and then post-process 
the buckets for satisfiability of the remaining constraints 
in order to retrieve the final result set of tagging action 
groups G app with maximum optimization score. Problem 
1 in Table 1 has all three tagging action components set 
to similarity. In this case, we build one long vector for 
each tagging action group g x C G by concatenating boolean 
vector corresponding to categorical user attributes, boolean 
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vector corresponding to categorical item attributes and nu- 
meric tag topic signature vector T rep (g x ). The dimension- 
ality of the high-dimensional space for Problem 1 is d + 

Effi Ei a i k = vA + Efii Y)-A \m = vA- 

Complexity Analysis: The pre-processing time and 
search time of the complete LSH based algorithms continue 
to be 0(nld' log n) and 0(n log n) respectively. The space 
complexity of the algorithms is 0(nl). 

Both SM-LSH-Fi and SM-LSH-Fo are efficient algorithms 
for solving TagDM similarity maximization problem in- 
stances and readily out-performs the baseline Exact, as 
shown in Section 6. However, there are other instantiations 
namely, Problems 4, 5 and 6 in Table 1 which concern tag di- 
versity maximization. Since it is non-obvious how the hash 
function may be inversed to account for dissimilarity while 
preserving the properties of LSH, we develop another set of 
algorithms (less efficient than LSH based, as per complexity 
analysis) in Section 5 for diversity problems. 

5. FDP BASED ALGORITHMS 

The second of our algorithmic solutions borrows ideas 
from techniques employed in computational geometry, which 
model data objects as points in high dimensional space and 
determine a subset of points optimizing some objective func- 
tion. Such geometric problem examples include clustering a 
set of points in euclidean space so as to minimize the max- 
imum intercluster distance, computing the k th smallest or 
largest inter-point distance for a finite set of points in eu- 
clidean space, etc. Since we consider tagging action groups 
as tag signature vectors, and since the cardinality of the 
global set of topics (that, in turn, determines the size of 
each vector) is often high, computational geometry based 
approach is an intuitive choice to pursue. 

We focus on a specific geometric problem, namely the 
facility dispersion problem (FDP), which is analogous to 
our TagDM problem instances, finding the tagging action 
groups maximizing the mining criterion. The facility dis- 
persion problem deals with the location of facilities on a 
network in order to maximize distances between facilities, 
minimize transportation costs, avoid placing hazardous ma- 
terials near housing, outperform competitors' facilities, etc. 
We consider the problem variant in Ravi et al.'s paper [18] 
that maximizes some function of the distances between fa- 
cilities. The optimality criteria considered in the paper are 
MAX-MIN (i.e., maximize the minimum distance between 
a pair of facilities) and MAX-AVG (i.e., maximize the av- 
erage distance between a pair of facilities). Under either 
criterion, the problem is known to be NP-hard by reduction 
from the Set Cover problem, even when the distances satisfy 
the triangle inequality [7]. The authors present an approx- 
imation algorithm for the MAX-AVG dispersion problem, 
that provides a performance guarantee of 4. The algorithm 
initializes a pair of nodes (i.e., facilities) which are joined by 
an edge of maximum weight and adds a node in each sub- 
sequent iteration which has the maximum distance to the 
nodes already selected. 

The facility dispersion problem solution provides an ap- 
proach to determine a set of tagging actions groups that 
have maximum average pair-wise distance, i.e., that are dis- 
similar in their tagging behavior. In fact, this approach may 
also be extended to determine a set of tagging action groups 
that are similar in their behavior, unlike the LSH based al- 
gorithm in Section 4 (which works only for similarity, not 



diversity). We consider each of the input n tagging action 
groups as d-dimensional tag signature vector in a unit hyper- 
cube and intend to identify k vectors with maximum average 
pairwise distance between them. We compare the input set 
G of n tagging action groups using a pairwise comparison 
function F!/(g\, g2, tags, diversity) that operates on tagging 
action group signature vectors; and return the set of tagging 
groups < k having maximum diversity in tagging behavior. 

Our FDP based algorithms work for Problems 4, 5 and 6 
in Table 1 maximizing tag diversity. We first introduce an 
algorithm that returns the groups having maximum diver- 
sity in tagging behavior (Column O in Table 1) and then 
discuss additional techniques to handle the multiple hard 
constraints in the solution (Column C in Table 1). 

5.1 Maximizing Diversity based on FDP 

Our FDP based algorithm DV-FDP handles TagDM 
problem instances optimizing tag Diversity. Given an in- 
put set G of n tagging action groups, each having a numeric 
tag signature vector T rep (g x ), where g x C G, we build the 
result set G app (we represent the result set as G app since the 
technique returns approximate solution) by adding a tagging 
action group in each iteration which has the maximum dis- 
tance to the groups already included in the result set. Again, 
we use cosine similarity measure between two tag signature 
vectors for determining the distance since the distance met- 
ric hold triangular inequality property. Thus, our DV-FDP 
attempts to find one tight set of k groups with maximum av- 
erage pairwise distance between them. The approximation 
bounds for this algorithm follows from [18] : 

Theorem 4. Let I be an instance of the TagDM problem 
maximizing the mining criterion with k > 2 and no other 
hard constraints, where the collection of n d-dimensional 
vectors are in a unit hypercube satisfying the triangle in- 
equality. Let G° pt and G app denote respectively the optimal 
set of k tagging action groups returned by Exact and DV- 
FDP algorithms. Then G opt /G app < 4. 

Algorithm 2 is the pseudo-code of our DV-FDP algorithm. 
Once the n x n distance matrix S is built using the cosine 
distance function, the implementation exhaustively scans S 
for determining the best add operation in each of the subse- 
quent iterations. If A represents the result set, the objective 
is to find an entry from G — A to add to A, such that its 
total sum of weight to a node in A is maximum. 



Algorithm 2 DV-FDP (G, O, k): G app 

//Main Algorithm 
1: S G <s— Compute n x n Distance Matrix(G) 
2: {g x , I x , g y , I y } «- MAX(S G ) 
3: A <- [g x ,g y ] 
4: while A / k do 

5: g z «- Z {z , elAUe[G -A}}MAX(S G - A ) 
6: A^[A,g z ] 
7: end while 
8: G app <- A 

9: return G app 

Complexity Analysis: The complexity of the imple- 
mentation of the DV-FDP algorithm is 0(n + nk), i.e., 
0(n 2 ) due to operations around the n x n distance matrix 
S . The space complexity of the algorithm is 0(n 2 ). Note 
that, our LSH based algorithms have better space and time 
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complexity than FDP based algorithms. However, experi- 
ments in Section 6 show comparable execution time for LSH 
and FDP based algorithms in a practical setting. 

Like SM-LSH, this algorithm does not leverage the hard 
constraints along user and item dimensions into the opti- 
mization solution, as well. We now illustrate approaches for 
including the multiple hard constraints into the solution. 

5.2 Dealing with Constraints: Filtering 

Similar to SM-LSH-Fi, a straightforward method of refin- 
ing the result set of groups for satisfiability of all the hard 
constraints in TagDM problem instances is post-processing 
or Filtering. We refer to this algorithm as DV-FDP-Fi. 
Once the result set Q°-pp of k groups is identified, we post- 
process it to retrieve the relevant answer set of tagging ac- 
tion groups, satisfying all the hard constraints. Note that, 
such post-processing of the result set for satisfiability of hard 
constraints may return null results frequently and hence we 
propose a smarter algorithm that folds some of the hard 
constraints into the DV-FDP approach, thereby decreasing 
the chances of hitting a null result. 

5.3 Dealing with Constraints: Folding 

In contrast to general DV-FDP algorithm whose objec- 
tive is to add groups to the result set greedily so that 
average pair-wise distance is maximized, we want to re- 
trieve the set in each iteration whose members, besides be- 
ing dissimilar, satisfy many other constraints. In DV-FDP, 
the greedy add operation in Line 5 of Algorithm 2 maxi- 
mizes tag diversity. If the algorithm includes a bad tag- 
ging action group to the result set in an iteration, the al- 
gorithm may return null result or an inferior approximate 
result, after final filtering of the result set for hard con- 
straint satisfiability. Therefore, we propose our second ap- 
proach in which multiple hard constraints are Folded into 
the add operation. We refer to this algorithm as DV- 
FDP-Fo. During each new group addition to the result 
set, we not only check for the pair with maximum distance, 
but also check for the satisfiability of the hard constraints 
Fp{gi, 52, users, m) > q and Fp(gi, r/2, items, m) > r, where 
m G {similarity, diversity}. The algorithm terminates 
when the number of groups in result set equals k. Once the 
result set of k groups is identified, we post-process the set for 
satisfiability of the support constraint, in order to retrieve 
the answer result of tagging action groups G apv . 

Complexity Analysis: The time and space complexity 
of the algorithm continues to be 0(n 2 ) in the worst case. 

Discussion: Table 2 broadly summarizes our algorith- 
mic contributions for solving the TagDM problem instances 
in Table 1. Note that, our algorithms are capable of han- 
dling all 112 concrete problem instances that our framework 
captures. 



Optimization 


Algorithm 


Constraints 


Additional Techniques 


similarity 


LSH based 


similarity 


fold constraints 


diversity 


filter constraints 


similarity, 
diversity 


fold similarity constraints, 
filter diversity constraints 


diversity 


FDP based 


similarity 


fold constraints 


diversity 


fold constraints 


similarity, 
diversity 


fold constraints 



Table 2: Summary of TagDM Problem Solutions. 



6. EXPERIMENTS 

We conduct a set of comprehensive experiments for quan- 
titative (Section 6.1) and qualitative (Section 6.2) analysis of 
our proposed algorithms for all 6 problems listed in Table 1. 
Our quantitative performance indicators are (a) efficiency of 
the algorithms, and (b) analysts quality of the results pro- 
duced. The efficiency of our algorithms is measured by the 
overall response time, whereas the result quality is measured 
by the average pairwise distance between the k tagging ac- 
tion group vectors returned by our algorithms (i.e., F pa ). 
In order to qualitatively assess the tagging behavior analy- 
sis generated by our approaches, we conduct a user study 
through Amazon Mechanical Turk as well as write interest- 
ing case studies. 

Data Set: We use the MovieLens 3 1M and 10M ratings 
dataset for our evaluation purposes. The MovieLens 1M 
dataset consists of 1 million ratings from 6000 users on 4000 
movies while the 10M version has 10 million ratings and 
100,000 tagging actions applied to 10,000 movies by 72,000 
users. The titles of movies in MovieLens are matched with 
those in the IMDB dataset 4 to obtain movie attributes. 

User Attributes: The 1M dataset has well-defined user 
attributes but no tagging information, whereas the 10M 
dataset has tagging information but no user attributes. 
Therefore, for each user in the 1M dataset with a complete 
set of attributes, we build her rating vector and compare it 
to the rating vectors (if available) of all 72,000 users in the 
10M dataset. For every user in 10M dataset, we find the user 
in 1M dataset such that the cosine similarity of their movie 
rating vector is the highest (i.e., user rating behaviors are 
most identical). The attributes of user in 10M dataset are 
obtained from the closest user in 1M dataset. In this way, 
we build a dataset consisting of 33,322 tagging and rating 
actions applied to 6,258 movies by 2,320 users. The tag vo- 
cabulary size is 64,663. The user attributes are gender, age, 
occupation and zip-code. The attribute gender takes 2 dis- 
tinct values: male or female. The attribute age is chosen 
from one of the eight age-ranges: under 18, 18-24, . . . , 56+. 
There are 21 different occupations listed by MovieLens such 
as student, artist, doctor, lawyer, etc. Finally, we convert 
zipcodes to states in the USA (or foreign, if not in USA) by 
using the USPS zip code lookup 5 . This produces the user 
attribute location, which takes 52 distinct values. 

Movie Attributes: Movie attributes are genre, actor 
and director. There are 19 movie genres such as action, 
animation, comedy, drama, etc. The pool of actor values and 
director values, corresponding to movies which have been 
rated by at least one user in the MovieLens dataset, is huge. 
We pick only those actors and directors who belong to at 
least one movie that has received greater than 5 tagging 
actions. In our experiments, the number of distinct actor 
attribute values is 697 while that of distinct director is 210. 

Mining Functions: The set of tagging ac- 
tion groups is built by performing a cartesian 
product of user attribute values with item at- 
tribute values. An example tagging action group is 
{gender=male, age=under 18, occupation=student, 
location=new york, genre=action, actor=tom hanks, 



3 http://www.grouplens.org/node/73 

4 http:/ / www. imdb . com/interfaces 

5 http:/ /zip4 ■ usps. com 
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Figure 3: Execution TimerProblems 1, 2, 3 in Table 1 
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Figure 4: Quality:Problems 1, 2, 3 in Table 1 
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Figure 5: Execution TimerProblems 4, 5, 6 in Table 1 
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Figure 6: Quality:Problems 4, 5, 6 in Table 1 



director=steven Spielberg}. The total number of pos- 
sible tagging action groups is more than 40 billion, while 
the number of tagging action groups containing at least 
one tuple is over 300K. For our experiments, we consider 
4535 groups that contain at least 5 tagging action tuples. 
The user and item similarity (or diversity) is measured by 
determining the structural distance between user and item 
descriptions of groups respectively. For topic discovery, we 
apply LDA [3] as discussed in Section 2.1.2. We generate a 
set of 25 global topic categories for the entire dataset, i.e., 
d = 25. For each tagging action group, we perform LDA 
inference on its tag set to determine its topic distribution 
and then generate its tag signature vector of length 25. 
Finally, we use cosine similarity for computing pairwise 
similarity between tag signature vectors. 

System Configuration: Our prototype system is imple- 
mented in Python. All experiments were conducted on an 
Ubuntu 11.10 machine with 4 GB RAM, AMD Phenom II 
N930 Quad-Core Processor. 

6.1 Quantitative Evaluation 

We compare the execution time of all 6 TagDM problem 
instantiations in Table 1 for the entire dataset (consisting 
of 33K tuples and 4K tagging action groups) using Exact, 
SM-LSH-Fi, SM-LSH-Fo, DV-FDP-Fi and DV-FDP-Fo al- 
gorithms. We use the name Exact for the brute-force ap- 
proach on both tag similarity and diversity maximization 
instances. For all our experiments, we set the number of 



tagging action groups to be returned at k = 3, since the 
Exact algorithm is not scalable for larger k. Figure 3 and 4 
compare the execution time and quality respectively of Ex- 
act and LSH based algorithms for Problems 1, 2 and 3 (Tag 
Similarity). Figure 5 and 6 compare the execution time and 
quality respectively of Exact and FDP based algorithms for 
Problems 4, 5 and 6 (Tag Diversity). The quality of the 
result set is measured by computing the average pair-wise 
cosine similarity between the tag signature vectors of the 
k = 3 tagging action groups returned. The group support 
is set at p — 350 (i.e., 1%); the user attribute similarity (or, 
diversity) constraint as well as the item attribute similarity 
(or, diversity) constraint is set to q = 50%, r — 50% respec- 
tively. For LSH based algorithms, the number of hash tables 
is I = 1 while the initial value of d! is 10. 

We observe that the execution time of our algorithms is 
much faster than Exact, for both tag similarity and tag di- 
versity problem instances. In Figure 3, the execution times 
of SM-LSH-Fi and SM-LSH-Fo for Problems 1, 2 and 3 are 
comparable to each other and is less that 1 minute. In Fig- 
ure 5, the execution times of DV-FDP-Fi and DV-FDP-Fo 
for Problems 4, 5 and 6 are slightly more than 3 minutes. 
Despite significant reduction in execution time, our algo- 
rithms do not compromise much in terms of analysis quality, 
as evident from Figure 4 and Figure 6. 

The number of input tagging action tuples available for 
tagging behavior analysis is dependent on the query under 
consideration. For the entire dataset, there are 33K such tu- 
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pies. However, if we want to perform tagging behavior anal- 
ysis of all movies tagged by {gender= male} or {location^ 
CA}, the number of available tuples is 26,229 and 6,256 re- 
spectively. Or, if want to perform tagging behavior anal- 
ysis of all users who have tagged movies having {genre= 
drama}, the number of tuples is 17,368. Needless to say, the 
number of tagging action tuples can have a significant im- 
pact on the performance of the algorithms since it affects 
the number of non-empty tagging action groups on which 
our algorithms operate. As a result, we build 4 bins having 
30K, 20K, 10K and 5K tagging action tuples respectively 
(assume, each bin is a result of some query on the entire 
dataset) and compare our algorithm performances for one 
of the tag similarity maximization problems and one of the 
tag diversity maximization problems, say Problem 1 and 
Problem 6 from Table 1 respectively. Both Problems 1 and 
6 have user and item dimension constraints set to similarity. 
Figures 7 and 8 compare the execution time and quality re- 
spectively of the Exact brute- force algorithm with our smart 
algorithms: SM-LSH-Fo for Problem 1 and DV-FDP-Fo for 
Problem 6. The group support is set at p = 350 (i.e., 1%); 
the user attribute similarity (or, diversity) constraint and 
the item attribute similarity (or, diversity) constraint are 
set to q = 50%, r = 50% respectively, and k = 3. For 
each bin along the X axis, the first two vertical bars stand 
for Problem 1 (tag similarity) and the last two stand for 
Problem 6 (tag diversity). 

As expected, the difference in execution time between our 
algorithms and the Exact is small for bins with lesser number 
of tagging tuples for both tag similarity and diversity. How- 
ever, our algorithms return results much faster than Exact 
for bins with larger number of tagging tuples. The quality 
scores continue to be comparable to the optimal answer, as 
shown in Figures 8. 

6.2 Qualitative Evaluation 

We now validate how social tagging behavior analysis can 
help users spot interesting patterns and draw conclusions 
about the desirability of an item, by presenting several anec- 
dotal results on real data. We also compare the utility and 
popularity of the 6 novel mining problems in Table 1 in an 
extensive user study conducted on Amazon Mechanical Turk 
(AMT) 6 . 
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6.2.1 Case Study 

We present few interesting anecdotal results returned by 
our algorithms for the following randomly selected queries: 

1. Analyze user tagging behavior for {director^ 
Steven Spielberg, genre= war} movies: Old male 
and young female use diverse set of tags for war movies 
"Saving Private Ryan" and "Schindler's List" directed 
by Steven Spielberg. This is because, the former is 
a movie about US military while the latter revolves 
around German military in World War II. Also, old 
male and young male tag "Schindler's List" dissimi- 
larly: the former likes it while the latter does not. 

2. Analyze tagging behavior of {gender^ 
male, location^ California} users for movies: 
Old male and young male living in California use 
similar tags for "Lord of the Rings" film trilogy of 
fantasy genre. However, they differ in their tagging 
towards "Star Wars" movies having similar genre. 
This is because, the genre of the latter series borders 
between fantasy and science fiction. Surprisingly, old 
male likes it while young male does not. 

6.2.2 User Study 

We conduct a user study through Amazon Mechanical 
Turk to elicit user responses towards the different TagDM 
problem instances we have focused on in the paper. We gen- 
erate analysis corresponding to all 6 problem instantiations 
for the following randomly selected queries: 

1. Analyze tagging behavior of {gender= male} users for 
movies. 

2. Analyze tagging behavior of {occupation^ student} 
users for movies. 

3. Analyze user tagging behavior for {genre= drama} 
movies. 

We have 30 independent single-user tasks. Each task is 
conducted in two phases: User Knowledge Phase and User 
Judgment Phase. During the first phase, we estimate the 
user's familiarity about movies in the task using a survey, 
besides her demographics. In the second phase, we ask users 
to select the most preferred analysis, out of the 6 presented 
to them, for each query. Responses from all users are aggre- 
gated to provide an overall comparison between all problem 
instances in Figure 9. The height of the vertical bars repre- 
sent the percentage of users, preferring a problem instance. 
It is evident that users prefer TagDM Problems 2 [find sim- 
ilar user sub-populations who agree most on their tagging 
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Figure 9: User Study 

behavior for a diverse set of items), 3 (find diverse user sub- 
populations who agree most on their tagging behavior for a 
similar set of items) and 6 (find similar user sub-populations 
who disagree most on their tagging behavior for a similar set 
of items), having diversity as the measure for exactly one of 
the tagging component: item, user and tag respectively. 

7. RELATED WORK 

To the best of our knowledge, our work is the first to de- 
velop a general framework that encompasses mining collab- 
orative tagging actions, studies its complexity and develops 
efficient algorithms. We summarize work related to topic 
discovery, tag mining and its applications, and the heuris- 
tics we use in our algorithms. 

There are many topic discovery techniques such as 
tf*idf [19], Latent Dirichlet Allocation (LDA) [3, 2] and 
OpenCalais. In this work, we use LDA, a generative proba- 
bilistic method proven to be robust when looking for hidden 
topics in Web documents [3, 1]. 

Tag mining has been used in multiple applications includ- 
ing tag recommendations [17], item recommendations [16, 
10], document navigation [11], and tagging motivation [15] 
However, most of these works are tailored to specific datasets 
and none of them defines a general mining problem, studies 
its complexity and develops efficient generic algorithms. 

Locality Sensitive Hashing (LSH) and the Facility Disper- 
sion Problem (FDP), were first introduced in [13, 8] and [12] 
respectively. LSH is used in prominent applications includ- 
ing duplicate detection and nearest neighbor queries [13]. In 
this work, we show how we adapt LSH to rank and choose 
the best bucket containing tagging analysis result. While 
being less efficient than LSH, the computational geometry 
based approach for the facility dispersion problem in [18] 
serves tag diversity problem instantiations and may be ex- 
tended to solve similarity problems. 

8. CONCLUSION 

In this paper, we developed the first framework to mine 
social tagging behaviors. We identified a family of min- 
ing problems that apply two opposing measures: similarity 
and diversity, to the three main tagging components: users, 
items, and tags. We showed that any instance of those is 
NP-Complete and developed efficient algorithms based on 
locality sensitive hashing and solutions developed in com- 
putational geometry for the facility disperson problem. Our 



extensive experiments on the MovieLens dataset show the 
superiority of our algorithms over the brute-force approach. 
In the future, we plan to handle updates and insertions of 
new users, items and tags. We also intend to explore the ap- 
plicability of our framework to other domains such as topic- 
centric exploration of tweets and news articles, an area that 
has been receiving a lot of attention lately. In particular, 
we would like to explore the usefulness of our techniques for 
mining and characterizing events in tweets and news. 
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